Dimension Reduction

A Quick Tour with Examples

USM Data Science

2026-02-03

What is Dimension Reduction?

Consider a typical machine learning situation

  • one response variable (or none - unsupervised)
  • one or more predictors (features, variables)

penguins data1.

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

We are interested in the dimension of the predictor space.

Think linear algebra

Dimension of the column space

An Example

Sakar et al. (2019)

  • 252 patients, 188 of whom had a previous diagnosis of Parkinson’s disease, 64 without.
  • speech signal processing techniques yield 750 numerical predictors, or features.


Goal: classify Parkinson’s status.


Can we fit a logistic regression model?

The parkinsons design matrix, \(X\)

  • 751 rows and 252 + 1 = 253 columns (one for the intercept)

  • More columns than rows!

  • So \(X\) not full rank - a problem for linear models.

Feature selection - one form of dimension reduction

Even if \(X\) full rank, we might want eliminate some predictors to decrease model variance.

  • near - zero variance predictors

  • highly correlated predictors (with each other)

  • predictors uncorrelated with the response

Many methods…

  • Forward/backward stepwise selection
  • Lasso regression

Another possibility: Feature engineering/extraction

Combine predictors into a smaller set of new predictors

More generally, can we transform and combine predictors in various ways to form a smaller set of new features?

Why do dimension reduction?

  • Reduce model variance
  • Reduce model complexity
  • Improve model interpretability
  • Reduce computational cost
  • Mitigate multicollinearity
  • Mitigate the “curse of dimensionality”

Much of this is model dependent

Also used for

  • Visualize high-dimensional data
  • data compression
  • autoencoders - representation learning

These last three are unsupervised!

Curse of dimensionality - simulation example

Curse of dimensionality - simple analytical example

Suppose KNN with \(p\) predictors \(\sim U(0,1)\)

  • volume of hypercube with sides of length \(d\) is \(d^p\).

  • To capture \(r\) proportion of the data, we want \(d^p = r\), so \(d = r^{1/p}\)

We are using the \(L_{\infty}\) norm (max norm) to define neighborhood here, but argument is similar for other norms.

Features

  • predictors or variables used in a model

  • transformed and combined versions of the original predictors

Feature engineering is the art of creating useful features; it may increase or decrease the number of predictors.

Dimension reduction

Reducing the number of features/dimension of feature space

  • Feature selection - select a subset of the original predictors

  • Feature extraction - create new predictors from the original ones

    • PCA, UMAP, tSNE, neural networks/deep learning

Feature extraction

Transforming and combining predictors to form new “features”

Canonical example: Principal Components Analysis (PCA)

Principal Components Analysis (PCA)

Create features - components from linear combinations of the predictors

  • The first component is in the direction of maximum variance of the data*.

  • Second component in direction of max variance orthogonal (and this uncorrelated) to the first, and so on.

*It thus makes sense to standardize predictors first if they are on different scales.

PCA - the math

Predictors \(X_1, X_2, \dots, X_p\); form linear combinations

\[ Z_m = \sum_{j=1}^p \phi_{jm} X_j, \qquad m = 1, 2, \dots, p \]

In matrix form,

\[ Z = X \mathbf{U} \]

Now with data

\[ \mathbf{Z} = \mathbf{XU} \]

Principal components and eigenvectors

First scale or standardize the data matrix.

The principal components (columns of \(\mathbf{U}\)) are the eigenvectors of \(\mathbf{X}^T \mathbf{X}\).

(Equivalently, the eigenvectors of the correlation matrix - and dont need to standardize.)

Theorem

  • Eigenvectors of \(\mathbf{X}^T \mathbf{X}\) are orthogonal

  • Let \[ \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_p > 0 \] with eigenvectors \(\mathbf{v}_1, \dots, \mathbf{v}_p\)

    • \(\lambda_i\) is variance in direction \(\mathbf{v}_i\)
    • \(\mathbf{v}_1\) maximizes variance
    • \(\mathbf{v}_2\) maximizes variance subject to orthogonality with \(\mathbf{v}_1\)
    • and so on

Principal components regression (PCR) in practice

  1. Center and scale the data

  2. Form \(\mathbf{U}\), the eigenvectors of the centered and scaled \(\mathbf{X}^T \mathbf{X}\), ordered by decreasing eigenvalues

  3. Compute \(\mathbf{Z} = \mathbf{XU}\)

  4. Select the number of components based on explained variance (or cross-validation)

  5. Fit a linear regression using the first \(m\) components

For \(m = p\), PCR equals ordinary least squares

Example: meatspec data

Analytical chemistry - predict fat content of meat samples from 100 channel infrared absorbance spectra.

PCA components cumulative variances

 [1] 0.98626 0.99596 0.99875 0.99990 0.99996 0.99999 0.99999 1.00000 1.00000
[10] 1.00000

Full model performance

metric estimate
rmse 0.88872
rsq 0.99511
mae 0.65249

PCR with 8 components performance

metric estimate
rmse 2.89542
rsq 0.94811
mae 2.26513

R squared vs RMSE

Source: Tidy Modeling with R, Max Kuhn and Julia Silge, 2023

Example: wine data (UCI ML Repository)

Chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the **three types* of wines.

Class_label Alcohol Malic_acid Ash Alcalinity_of_ash Magnesium Total_phenols Flavanoids Nonflavanoid_phenols Proanthocyanins Color_intensity Hue OD280_OD315_of_diluted_wines Proline
1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450

Total and explained variance of the PCA components

Multinomial regression on the PCA features


We get perfect score with just 5 components, and 98% with only 3 components.


num_pca_components accuracy
1 0.848315
2 0.960674
3 0.983146
4 0.983146
5 1.000000
6 1.000000
7 1.000000
8 1.000000
9 1.000000
10 1.000000
11 1.000000
12 1.000000
13 1.000000

Loadings and scores


PCA loadings; the \(\mathbf{U}\) values

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13
Alcohol -0.144 -0.484 -0.207 -0.018 0.266 -0.214 -0.056 -0.396 -0.509 -0.212 0.226 0.266 -0.015
Malic_acid 0.245 -0.225 0.089 0.537 -0.035 -0.537 0.421 -0.066 0.075 0.309 -0.076 -0.122 -0.026
Ash 0.002 -0.316 0.626 -0.214 0.143 -0.154 -0.149 0.170 0.308 0.027 0.499 0.050 0.141
Alcalinity_of_ash 0.239 0.011 0.612 0.061 -0.066 0.101 -0.287 -0.428 -0.200 -0.053 -0.479 0.056 -0.092
Magnesium -0.142 -0.300 0.131 -0.352 -0.727 -0.038 0.323 0.156 -0.271 -0.068 -0.071 -0.062 -0.057
Total_phenols -0.395 -0.065 0.146 0.198 0.149 0.084 -0.028 0.406 -0.286 0.320 -0.304 0.304 0.464
Flavanoids -0.423 0.003 0.151 0.152 0.109 0.019 -0.061 0.187 -0.050 0.163 0.026 0.043 -0.832
Nonflavanoid_phenols 0.299 -0.029 0.170 -0.203 0.501 0.259 0.595 0.233 -0.196 -0.216 -0.117 -0.042 -0.114
Proanthocyanins -0.313 -0.039 0.149 0.399 -0.137 0.534 0.372 -0.368 0.209 -0.134 0.237 0.096 0.117
Color_intensity 0.089 -0.530 -0.137 0.066 0.076 0.419 -0.228 0.034 -0.056 0.291 -0.032 -0.604 0.012
Hue -0.297 0.279 0.085 -0.428 0.174 -0.106 0.232 -0.437 -0.086 0.522 0.048 -0.259 0.090
OD280_OD315_of_diluted_wines -0.376 0.164 0.166 0.184 0.101 -0.266 -0.045 0.078 -0.137 -0.524 -0.046 -0.601 0.157
Proline -0.287 -0.365 -0.127 -0.232 0.158 -0.120 0.077 -0.120 0.576 -0.162 -0.539 0.079 -0.014


Cumulative normalized variances

 [1] 0.361988 0.554063 0.665300 0.735990 0.801623 0.850981 0.893368 0.920175
 [9] 0.942397 0.961697 0.979066 0.992048 1.000000


PCA scores, \(\mathbf{Z} = \mathbf{XU}\)

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13
-3.30742 -1.439402 -0.165273 -0.215025 -0.691093 -0.223250 0.594749 0.064956 -0.639638 -1.018084 0.450293 -0.539289 0.066052
-2.20325 0.332455 -2.020757 -0.290539 0.256930 -0.924512 0.053624 1.021534 0.307978 -0.159252 0.142256 -0.387146 -0.003626
-2.50966 -1.028251 0.980054 0.722863 0.250327 0.547731 0.423012 -0.343248 1.174521 -0.113042 0.285866 -0.000582 -0.021655
-3.74650 -2.748618 -0.175696 0.566386 0.310964 0.114109 -0.382259 0.641783 -0.052397 -0.238739 -0.757448 0.241339 0.368444
-1.00607 -0.867384 2.020987 -0.408613 -0.297618 -0.405376 0.442825 0.415528 -0.325900 0.078146 0.524466 0.216055 0.079140
-3.04167 -2.116431 -0.627625 -0.514187 0.630241 0.123083 0.400524 0.393783 0.151718 0.101709 -0.404444 0.378365 -0.144747
-2.44220 -1.171545 -0.974346 -0.065645 1.024871 -0.618376 0.052742 -0.370888 0.455730 -1.013704 0.441189 -0.140833 0.271014
-2.05364 -1.604437 0.145870 -1.189253 -0.076687 -1.435756 0.032285 0.232324 -0.123023 -0.733531 -0.292729 -0.378595 0.109854
-2.50381 -0.915488 -1.765987 0.056112 0.889747 -0.128818 0.124933 -0.498173 -0.604883 -0.173617 0.507501 0.633462 -0.141684
-2.74588 -0.787217 -0.981479 0.348399 0.467235 0.162932 -0.871893 0.150156 -0.229841 -0.178915 -0.012443 -0.548779 0.042335

Visualizing the loadings

Just the first PCA component

Visualizing the loadings

First four PCAcomponents

Example: USArrests dataset1 (

Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7

Loadings and scores


PCA loadings; the \(\mathbf{U}\) values

PC1 PC2 PC3 PC4
Murder -0.536 -0.418 0.341 0.649
Assault -0.583 -0.188 0.268 -0.743
UrbanPop -0.278 0.873 0.378 0.134
Rape -0.543 0.167 -0.818 0.089


Cumulative normalized variances

[1] 0.620060 0.867502 0.956642 1.000000


PCA scores, \(\mathbf{Z} = \mathbf{XU}\)

PC1 PC2 PC3 PC4
Alabama -0.975660 -1.122001 0.439804 0.154697
Alaska -1.930538 -1.062427 -2.019500 -0.434175
Arizona -1.745443 0.738460 -0.054230 -0.826264
Arkansas 0.139999 -1.108542 -0.113422 -0.180974
California -2.498613 1.527427 -0.592541 -0.338559
Colorado -1.499341 0.977630 -1.084002 0.001450
Connecticut 1.344992 1.077984 0.636793 -0.117279
Delaware -0.047230 0.322089 0.711410 -0.873113
Florida -2.982760 -0.038834 0.571032 -0.095317
Georgia -1.622807 -1.266088 0.339018 1.065974

Visualizing the feature contributions in the components

Biplots

We can visualize the loadings and the scores together with a biplot.

Non-linear dimension reduction

UMAP, t-SNE, autoencoders (neural networks)

UMAP: penguins revisited

We use just the four numerical predictors

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

PCA vs UMAP on penguins

Penguins pairplot

UMAP may be overkill here

UMAP

mnist dataset1 - images of handwritten digits (0-9). 784 features (28 x 28 pixels)

PCA vs UMAP on mnist

Feature extraction

Transforming and combining predictors to form new “features”

Canonical example: Principal Components Analysis (PCA)

  • unsupervised
  • Partial Least Squares (PLS) - supervised Other methods
  • Linear Discriminant Analysis (LDA)
  • t-SNE
  • UMAP
  • Autoencoders (neural networks)